Use hash join when writing sparkey #5402

aslotnick · 2024-06-24T21:10:34Z

When writing to sparkey, allShards represents every expected shard even if there is no corresponding data in shards for that shard number.

shards.rightOuterJoin(allShards) (added in #5208) fails when a shard contains large amounts of data, leading to the error described in #5300: java.lang.OutOfMemoryError: Required array length 2147483639 + 15534 is too large.

This PR replaces rightOuterJoin with hashFullOuterJoin (note that there is no hashRightOuterJoin implementation). A hash join is a good fit because the right-hand side contains very little data (only the keys of the shards) and it doesn't need to use an array to represent the large left-hand side's values. As a result, some failing workflows that succeeded in Scio 0.13.* will run successfully again.

shnapz

allShards should fit into memory given numShards is Short

kellen

Hey thanks for looking into this, was about to start investigating myself. This seems like it should work and still sidestep the issues fixed in #5208

codecov · 2024-07-08T13:24:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.26%. Comparing base (6d755f9) to head (bfcec57).
Report is 15 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5402      +/-   ##
==========================================
+ Coverage   61.22%   61.26%   +0.03%     
==========================================
  Files         310      310              
  Lines       11061    11061              
  Branches      755      755              
==========================================
+ Hits         6772     6776       +4     
+ Misses       4289     4285       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Use hash join when writing sparkey

bfcec57

shnapz approved these changes Jun 24, 2024

View reviewed changes

kellen approved these changes Jun 25, 2024

View reviewed changes

RustedBones merged commit c915a53 into spotify:main Jul 8, 2024
11 checks passed

aslotnick deleted the as/sparkey-join-fix-1 branch July 8, 2024 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use hash join when writing sparkey #5402

Use hash join when writing sparkey #5402

aslotnick commented Jun 24, 2024

shnapz left a comment

kellen left a comment

codecov bot commented Jul 8, 2024

Use hash join when writing sparkey #5402

Use hash join when writing sparkey #5402

Conversation

aslotnick commented Jun 24, 2024

shnapz left a comment

Choose a reason for hiding this comment

kellen left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 8, 2024

Codecov Report